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Abstract 

In many graphics applications, the computation of exact geodesic 
distance is very important. However, the high computational cost 
of the existing geodesic algorithms means that they are not practi- 
cal for large-scale models or time-critical applications. To tackle 
this challenge, we propose the parallel Chen-Han (or PCH) algo- 
rithm, which extends the classic Chen-Han (CH) discrete geodesic 
algorithm to the parallel setting. The original CH algorithm and 
its variant both lack a parallel solution because the windows (a 
key data structure that carries the shortest distance in the wave- 
front propagation) are maintained in a strict order or a tightly cou- 
pled manner, which means that only one window is processed at 
a time. We propose dividing the CH's sequential algorithm into 
four phases, window selection, window propagation, data organi- 
zation, and events processing so that there is no data dependence 
or conflicts in each phase and the operations within each phase can 
be carried out in parallel. The proposed PCH algorithm is able 
to propagate a large number of windows simultaneously and inde- 
pendently. We also adopt a simple yet effective strategy to control 
the total number of windows. We implement the PCH algorithm 
on modern GPUs (such as Nvidia GTX 580) and analyze the per- 
formance in detail. The performance improvement (compared to 
the sequential algorithms) is highly consistent with GPU double- 
precision performance (GFLOPS). Extensive experiments on real- 
world models demonstrate an order of magnitude improvement in 
execution time compared to the state-of-the-art. 

Keywords: Discrete geodesic, parallel computation, window 
propagation, GPU 

1 Introduction 

Computing geodesies on triangle meshes, as a fundamental prob- 
lem in geometric modeling, has been widely studied since the mid- 
1980s. Over the years, several algorithms have been proposed 
to compute the "single-source-all-destination" geodesic distance, 
such as, th e MMP alg orithm [M itchell et al. 198 7 1, the CH algo- 
rithm I Chen and Han 1990| and the improved CH (or ICH) algo- 
rithm ^Xin and Wang 2009] , the fast marching method | Kimmel' 
and Sethian 1998| and the approximate MMP algorithm [Surazh- 
sky et al. 2005 1. While the state-of-the-art approaches [Surazhsky 
et al. 2005 1 |Xin and Wang 2009 1 work quite well for models of 
moderate size, their high computational cost means that they are 
not practical for large-scale models or time-critical applications. 

In the past decade, there has been an increasing trend of perform- 
ing the traditionally-CPU-handled computation on a graphics pro- 
cessing unit (GPU), which uses large numbers of graphics chips 
to parallelize the computation. However, developing parallel al- 
gorithms for discrete geodesies is technically challenging due to 
the lack of parallel structure; specifically, the geodesic distance is 
propagated from the source to all destinations in a sequential or- 
der. To date, the only parallel geodesic algorithm is that of Weber 
et al. [20081, who developed a raster scan-based version of the fast 
marching algorithm. Although Weber et al.'s method is highly ef- 
ficient, it only computes the first-order approximation of geodesic 
and requires parameterization of the surface into a regular domain. 
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which is usually difficult for surfaces with complicated geometry 
and/or topology. To our knowledge, there is no parallel algorithm 
with which to compute an exact geodesic on triangle meshes. 

Technical challenges The existing exact geodesic algorithms (for 
example, the MMP, CH and ICH algorithms) do not support par- 
allel computing due to the lack of parallel structure. These algo- 
rithms partition each edge into a set of intervals, called windows, 
which are maintained in a queue and then propagated across the 
mesh faces. The propagation step pops a window from the queue 
and then computes its children windows and performs clipping or 
merging if necessary, which can add, modify, or remove existing 
windows, and the queue is updated accordingly. These algorithms 
generate the correct solution regardless of the order in which win- 
dows are removed from the queue; however, selecting windows in 
arbitrary order causes extremely slow performance. 

The MMP algorithm and the ICH algorithm propagate the windows 
as a wavefront by ordering them in a priority queue according to 
their distance from the source. However, maintaining a priority 
queue is a sequential and global process. It is highly non-trivial 
to implement a parallel priority queue on GPUs. The performance 
of the MMP and ICH algorithms drops significantly without the 
priority queue. 

The CH algorithm organizes the windows in a tree data structure 
and propagates the windows in a breadth-first-search sweep from 
the root to the leaves. Due to the data dependency between parents 
and their children nodes, only windows on the same level can be 
processed in parallel. To maintain this global tree data structure, a 
CPU/GPU synchronization must be conducted for each layer. As 
the depth of the tree equals the number of mesh faces, the 0{n) 
times synchronization is too expensive for large models, making 
parallel implementation unpractical. 

Our contributions This paper extends the classic CH algorithm for 
parallel computing the exact geodesic distance on triangle meshes. 
Our algorithm, which is called the Parallel Chen-Han (or PCH) al- 
gorithm, follows the window propagation framework, as all the ex- 
act algorithms do. Our idea is to divide CH's sequential algorithm 
into several phases in such a way that there is no data dependence 
in each phase and the operations within each phase can be done in 
parallel. Our parallel algorithm design considers three key factors: 

(1) propagating the windows in as parallel a manner as possible; 

(2) controlling the total number of windows effectively; and (3) 
avoiding data conflicts during window propagation. The following 
concepts made it possible for us to address these issues. 

• First, in contrast to the existing approaches, which main- 
tain the windows in a "tightly-coupled" structure (such as 
the tree data structure in the CH algorithm) or a "strictly- 
ordered" structure (such as the priority queue in MMP/ICH 
algorithms), the windows in the proposed PCH algorithm are 
loosely organized. Consequently, our method neither sorts the 
windows nor traces the ancestor windows. Therefore, the win- 
dows can be propagated in a fully parallel and independent 
manner. 

• Second, to reduce the total number of windows, the PCH algo- 
rithm adopts a fc-selection to determine the k windows near- 
est to the source, where the parameter k is specified by the 
user. These windows will then be processed independently by 



multiple GPU threads. Together with a parallel fc-selection 
and the window filtering techniques in the CH and ICH al- 
gorithms, our parallel algorithm is very effective for window 
propagation, as well as for controlling the total number of 
windows. Extensive evaluation of real-world models shows 
that our algorithm only has slightly more windows than the 
ICH algorithm, which uses a priority queue to maintain the 
wavefront. 

• Third, during window propagation, if a window brings the 
vertex v a shorter distance than its current distance, our 
method does not immediately update the geodesic distance 
at V immediately, since other threads may also access v at 
the same time and such an update would cause data conflicts. 
Instead, our method triggers an update event. Once all the se- 
lected windows have been propagated, our method processes 
these events and updates the corresponding data. This updat- 
ing can also be done in a parallel manner. 

The proposed PCH algorithm is the first parallel technique for cal- 
culating exact geodesic distance on triangle meshes. We have im- 
plemented our algorithm on modem GPUs and obtained promising 
results on a wide range of models. We have observed that the per- 
formance improvement is highly consistent with the GPUs' double- 
precision performance (GFLOPS). For example, our method can 
compute the exact geodesic distance field on the 1.8-million-face 
Blade model (see Figure 1) in 11 seconds on an Nvidia GTX580 
graphics card, which is an order of magnitude faster than the state 
of the art | Xin and Wang 2009) . This indicates that our algorithm is 
suitable for applications involving intensive geodesic distance com- 
putations. 
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Figure 1: Computing the exact geodesic distance on triangle 
meshes by using the proposed PCH algorithm. Our implementa- 
tion on Nvidia GTX 580 is one magnitude of order faster than [Xin 
and Wang 2009]. 



2 Related Work 

The discrete geodesic problem has been widely studied since mid- 
dle 80s. There are many elegant algorithms to compute the single- 
source-all-destination geodesic, geodesic path, geodesic loop and 
all-pairs geodesic. Due to the limited space, we review only the 



representative work of the single-source-all-destination geodesic, 
which is our major target. 

Exact geodesic algorithms Sharir and Schorr 1 1986| pioneered 
the discrete geodesic algorithm with time complexity 0(n^ log n). 
However, their algorithm applies only to convex polyhedral. Us- 
ing a continuous Dijkstra sweep, Mitchell, Mount and Papadim- 
itriou f 1987 1 presented the first practical algorithm to compute 
geodesic distances on general polyhedral surfaces with time com- 
plexity 0{n^ logn). Liu et al. |2007| improved the robustness of 
the MMP algorithm by presenting effective techniques to handle 
the degenerate cases. Surazhsky et al. |2005| observed the worst 
case running time of the MMP algorithm is rare, and in practice 
the algorithm runs in sub-quadratic time. However, the MMP algo- 
rithm is memory inefficient due to its quadratic memory complex- 
ity O(n^), which diminishes its application to large-scale models. 
Chen and Han 1 1990 1 improved the time complexity to 0(n^) by 
organizing the windows with a tree data structure. As only the 
branch nodes are saved, the CH algorithm has linear space com- 
plexity 0{n). Although the time complexity of the CH algorithm 
is better than that of the MMP algorithm, extensive experiments 
show that the CH algorithm runs much slower than the MMP algo- 
rithm, mainly because most of the windows in the CH algorithm are 
useless, which do not contribute to the shortest distance. Xin and 
Wang 1 2009 1 presented an effective window filtering technique to 
reduce the windows significantly, and used a priority queue to or- 
der the windows according to the minimal distance from the source. 
They demonstrated that their improved CH algorithm (ICH) outper- 
forms the MMP and CH algorithms in terms of both execution time 
and memory cost. Schreiber and Sharir | Schreiber and Sharir 2006] 
presented an 0{n log n)-time algorithm for convex polyhedral sur- 
faces, which reaches the theoretical lower bound. However, the 
optimal time bound for general polyhedral surfaces is still an open 
problem. With the exact geodesic distance, Liu et al. (201 1| sys- 
tematically investigated the analytic structure of iso-contours and 
bisectors on triangle meshes, and proposed efficient algorithm for 
computing geodesic Voronoi diagrams. 

Approximation geodesic algorithms Sethian [19961 proposed the 
O(nlogn) fast marching method for first-order approximation of 
geodesies on regular grids. Kimmel and Sethian |1998| extended 
the fast marching method to triangle meshes. Weber et al. 1 2008 J 
presented a raster scan-based version of the fast marching algo- 
rithm for approximating the geodesic distance on geometry images. 
Thanks to its parallel structure, their approach allows highly effi- 
cient parallelization on modern GPUs. Polthier and Schmies 1 1998J 
proposed the locally straightest geodesic, which differs from the 
conventional locally shortest geodesic. Surazhsky et al. [ 2005) 
presented the approximate MMP algorithm that has optimal time 
complexity 0{nlogn) and computes approximate geodesies with 
bounded error. 

3 Preliminary 

Let Al — {V,E, F) be a triangle mesh representing an orientable 
2-manifold where V, E and F are the vertex, edge and face sets, 
respectively. For a vertex v £V , the total angle is the sum of inte- 
rior angles formed between each pair of neighboring edges incident 
at V. A vertex v is called spherical if its total angle is less than 2ii, 
Euclidean if the total angle equals 2-n, and saddle if the total angle 
is greater than 2-n. Mitchell et al. |1987| proved that there exists 
a geodesic path from the source s, typically a mesh vertex, to any 
other point on the surface. They also showed that a geodesic path 
cannot pass through a spherical vertex except that it is an end point. 

Theorem |Mitchell et al. 1987] The general form of a geodesic 
path is a path that goes through an alternating sequence of vertices 




(a) (b) 

Figure 2: (a) A window represents an interval [A, B] on an ori- 
ented edge e, which is "visible" by the pseudo source I after un- 
ft>lding the corresponding face sequence, (b) The position of the 
pseudo source on the unfolded plane can be determined by do and 



and edges such that the unfolded image of the path along any edge 
sequence is a straight line segment and the angle of the path passing 
through a vertex is greater than or equal to tt. 

If a geodesic path from the source s to destination t passes through 
one or more saddle vertices, we call the vertex, which is nearest to t, 
a pseudo source. Clearly, a geodesic path must be a straight line in- 
side a triangle. When crossing over an edge, the geodesic path must 
also correspond to a straight line if the two adjacent faces are un- 
folded into a common plane. The exact geodesic algorithms parti- 
tion each mesh edge into a set of intervals, called windows ISurazh^ 
[sky et al. 2005 J |Xin and Wang 2009 1, each of which encodes the 
geodesic paths passing through the same face sequence. 

As shown in Figure|2] a window data structure associated to an edge 
e is a 6-tuple (d, ^, B, do, di , e) where 

• d is the distance from the pseudo source to the source s; 

• A and B are the left and right endpoints of the interval; 

• do and di are the distances from the edge endpoints to the 
(pseudo) source. 

An important step in the exact geodesic algorithms is to propagate 
the windows across the adjacent triangle, which yields new win- 
dows on the opposite edge(s). In general, a window at edge e can 
have up to two children at the edges opposite to e. See Figure|3[a)- 
(b). For a special case where the geodesic path passes through a 
saddle vertex v, one or more windows are then created for the edges 
in a fanned area. See Figure[3|c). 

Chen and Han 1 1990] proposed using a tree data structure to keep 
record of the parent-child relationship. The depth of the tree is equal 
to the number of faces. To avoid exponential explosion in the tree, 
Chen and Han adopted a simple "one angle one split" scheme: if 
two windows occupy a vertex, at most one of them can have two 
children. See Figure|4ja). Xin and Wang | 2009| further reduced the 
number of windows by using a strict window filtering technique. 
See Figure |4|b). They also suggested using the priority queue to 
organize the windows according to their distance back to the source. 
Adopting the priority queue together with the two window filtering 
techniques, the ICH algorithm outperforms both the MMP and CH 
algorithms. 

Mitchell et al. f 19871 proved that each edge may have 0{\E\) 
windows and therefore the total number of windows is 0(ji5p), 
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Figure 3: Window propagation results in new windows, (a) A win- 
dow creates one child on the opposite edge, (b) A window creates 
two children on opposite edges, (c) For a saddle point v, the injec- 
tion ray I is split into two directions (dashed grey lines), each is an 
extended straight line of I after unfolding. Then the new windows 
are generated on the opposite edges in the fanned area formed by 
the two directions. 



which is the theoretical upper bound. In practice, Surazhsky et 
al. (2005 1 observed that for typical meshes an edge has an average 
of 0{^\E\) windows. Thus, the key for developing an efficient 
geodesic algorithm is to organize the windows effectively. 
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Figure 4: Window filter (a) The "one angle one split" filter in 
the CH algorithm. Two windows wi and W2 share the same edge 
{vq, Hi) and occupy the vertex V2. However, the pseudo source I\ 
can not provide a shorter distance to V2. Thus, only window W2 can 
generate two children, (b) The filter used in the ICH algorithm. The 
window w on edge [vq, vi) is generated by some window on edge 
[vi, V2). Let Qi be the current shortest distance for vertex Vi, i = 
0, 1, 2. The window w is useless, if one of the following inequalities 
holds: d+ \\IB\\ > go + \\voB\\ or d + \\IA\\ > gi + \\viA\\ or 
d+\\IB\\ >g2 + \\v2B\\. 

4 Computing Geodesies in Parallel 

The proposed parallel Chen-Han algorithm is based on the shared 
memory model so that all of the data (such as mesh data structure, 
geodesic distance, etc.) are assessable to all processors. This sec- 
tion presents our algorithm on modern GPUs, as these are readily 
available to the graphics community. 

4.1 Data Structure 

Our algorithm maintains several global variables, as shown in Ta- 
ble[T| Let Al = {V, E, F) denote the input triangle mesh, where V, 
E and F are the set of vertices, edges and faces. We use the half- 
edge data structure to encode A/'s incidence information. Consid- 
ering the limited GPU memory, we adopt a minimal half-edge data 
structure. Each edge is decomposed into two half-edges with op- 
posite orientations. Each half-edge stores the index of the starting 
vertex, the index of the opposite half-edge and its length. Each ver- 
tex references one outgoing half-edge. 

struct half_edge { 
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Table 1: Global variables. 



int starting_vertex_id; 
int opposite_half_edge_icl; 
double length; 



half_edge he [ ] ; // size 3|F| 
half_edge outgoing_he [ ] ; // size !V| 

Compared to the standard half-edge data structure, our struct 
half-edge does not have the face that it boarders and the next half- 
edge around the face. The complete incidence information is en- 
coded into two arrays, he of dimension 3|F|, which stores all half- 
edges of A/, and outgoingjie, of dimension \ V\, which stores one 
of the half-edges emanating from each vertex. The three half-edges 
of the j-th triangle are stored at /ie[3i], he[?>i + 1] and /ie[3i + 2], 
respectively. Given the half_edge[j], the triangle that it boarders 
is the [j/3j-th triangle and the next-half-edge around that face is 
luilf_edge[3lj /3\ + (j + 1)%3], where [-J is the floor function. 

To access the one-ring neighbors of the i-th vertex Vi, one can sim- 
ply start with the outgoing half-edge outgoing Jie[i] and iteratively 
switch to the opposite half-edge and find the next half-edge that 
points to neighboring vertex. The half-edge data structure, he and 
outgoing _he, are located in the GPU memory and accessible to the 
GPU threads in the real-only manner. The PCH algorithm does not 
require the vertex coordinates, instead, it needs only the metric (the 
edge length). 

double geod_dist[]; // size |V 
window angle_split [ ] ; // size 3 IF 

The array geod_dist[0..\V\-l] stores the distance value at each ver- 
tex. The initial value for each non-source vertex is oo. When the al- 
gorithm terminates, the value geod_dist[i] gives the globally short- 
est geodesic distance at the i-th vertex. 

The array a«g/e^sp/!Y[0..3|F|-l] contains the window that can pro- 
vide the shortest distance to each angle. Such information is used 
by the Chen-Han's "one angle one split" window filter. 

During the wavefront propagation, a large number of windows will 
be created, processed, and then discarded. To allow efficient access 
of the windows, our program creates a memory pool in the GPU 
memory. The memory pool contains all the active windows, which 
are stored continuously so that there are no memory gaps between 
adjacent windows. This condition allows us to distribute the active 
windows evenly to all GPU threads. The memory pool also pro- 
vides each GPU thread with a buffer for the temporary storage of 
the new windows. The global array bujfer_address[Q..T-V\ stores 
the beginning address and size of each buffer. Unlike the contin- 
uous storage of the active windows, these thread buffers are not 
physically continuous in the memory pool. 

4.2 PCH Algorithm 

Let S — {si}^i be the set of source points specified by the user 
For each source point Si, our algorithm creates a window for every 
edge facing Si, and then iteratively processes the windows. 




Figure 5: During each iteration, our algorithm selects k windows 
that are nearest to the source points and assigns these windows to T 
GPU threads. Each window is taken by one GPU thread and then 
propagates independently, which will result in one or more chil- 
dren windows. After all the selected windows have been processed, 
our algorithm collects the newly -generated windows and organize 
all windows such that there are no memory gaps between adjacent 
windows in the memory pool. The empty slots in the memory pool 
are drawn in light gray. 



Each iteration contains the following four steps: First, the algo- 
rithm selects k windows that are nearest to the source points s;, 
i = 1, ■ ■ ■ ,m. Second, the selected k windows are processed by T 
threads in a completely independent manner. The window propaga- 
tion results in new windows, which are stored in each thread's own 
buffer. If a window w can provide a shorter distance for a vertex 
V, our algorithm does not immediately update the corresponding 
entries of geod_dist or angle_split, as doing so would cause con- 
flicts. Instead, the algorithm creates an update event, which will be 
processed later Third, the algorithm collects the new windows and 
organizes them with the existing windows in the memory pool, so 
that there are no memory gaps between the adjacent windows. Fi- 
nally, the algorithm processes these update events and updates the 
arrays geodjdist and anglesplit for the corresponding vertices 
and comer angles. 

These four steps are repeated until all windows in the memory pool 
have been processed. The data flow for each iteration is shown in 
Figure |5] and the pseudo code is shown in Algorithm 1. Next, we 
explain the algorithm in details. 

Initialization (lines 1, 2) In the initialization stage, our algo- 
rithm creates a memory pool, and two global arrays: the first ar- 
ray geod_dist[\V\] contains the geodesic distance at each vertex, 
where the initial value is for the source points and oo for all other 
vertices; the second array angle_split[3\F\] contains the windows 
for each angle, which is used by the "one angle one split" window 
filter. For every edge opposite to the source point Si, a window is 
created and put in the memory pool. 

Step 1. Parallel selection (line 3) Rather than sorting all windows, 
our algorithm selects k windows, which are closest to the source 
points. The fc-selection is a fundamental problem in computer sci- 
ence, which has been widely studied. The state-of-the art parallel 
algorithm is due to Monroe et al. | |201 1) . Their algorithm proceeds 



Algorithm 1: Parallel Chen-Han Algorithm 



Algorithm 2: PropagatingWindow(ui) 



Input: A triangle mesh M — {V, E, F), the set of source points 
S = {si\si £ V,l < i < m}, the selection parameter k, and 
the number of GPU threads T 
Output: The geodesic distance for each vertex 
1: initialize the global variables 

2: parallel create a window for every edge opposite to the source 

points 
3: Repeat 

4: parallel select k nearest windows 

5: parallel propagate the selected windows 

6: organize the newly-generated windows and update events 

7: parallel process the update events 

8: Until all windows are processed 



via an iterative probabilistic guess-and-check process on pivots for 
a three-way partition. When the guess is correct, the problem is 
reduced to selection on a much smaller set. They proved that their 
probabilistic algorithm always gives a correct result and always ter- 
minates. We adopt Monroe et al.'s parallel algorithm in our GPU 
implementation. We also develop an approximate fc-selection to im- 
prove the speed. We choose fc ^ T to fully utilize the pipeline of 
the GPU cores. The details of optimal parameter selection and the 
approximated version of fc-selection will be presented in Sec. |5.2[ 

Step 2. Parallel window propagation (line 4) All of the selected 
windows are evenly distributed to the T GPU threads. Some win- 
dows may generate one or more children, while others may not. 
Statistics on a wide range of real-world models show that each win- 
dow generates an average maximum of 3.8 childrerQ Therefore, 
we allocate each GPU thread a 4 -sized buffer in the memory 
pool to contain the new windows. This fixed-size buffer works well 
in practice and we do not observe any buffer overflow in our ex- 
periments. In case of an overflow in the buffer of the i-th thread, 
for example, the thread i creates a new buffer of doubled size, 
copies the contents to the new buffer, and then updates the entry 
bujfer_address[i\. These buffers are not required to be consecutive 
in the memory, since they store the new windows temporarily. 

Note that during the window propagation, if a new window 
can provide the vertex v a shorter distance than its current one 
geod-dist[v\, we should update geod-dist[v\ accordingly. How- 
ever, this would cause data conflicts, as other GPU threads may also 
access v at the same time. To avoid conflicts, we create a distance 
update event for v. Similarly, if a window w occupies the vertex 
V and provides v a shorter distance than that of the window in the 
corresponding entry of angle_split, we also create an angle update 
event. All these events will be processed later. The pseudo code for 
window propagation is shown in function Propagating Window(). 

Step 3. Data organization (Une 5) In the above window propaga- 
tion step, each GPU thread Ti generates a certain number of win- 
dows, which are stored in Ti's own buffer. The purpose of this step 
is to organize these newly-generated windows in the memory pool 
such that there are no memory gaps between adjacent windows, 
which is a critical condition for the parallel selection and parallel 
updating. 

Since each GPU thread counts the number of new windows inde- 
pendently, we use 0{T) space to store these numbers, where T is 



'in general, each window has at most two children during the propa- 
gation. See Figure |3] However, a saddle vertex can produce more than 2 
windows. The worse case is that one window is created for each edge oppo- 
site to the saddle vertex. Our statistics show that when processing L(> 8) 
windows in parallel, the number of new windows is less than 3.8-L. 



// assume w — (d, A, B, do,di, e) is on the oriented 
// edge e — (wo, vi ) and V2 is the opposite vertex of e; 
if A = vq && do + d < geod_dist[vo] then 

// w provides a short distance to vo ; 

create a distance update event {vo, do + d); 

if Vo is a saddle vertex then 
|_ create windows in the fanned area (see Fig.|3|c)); 

if B = vi && di + d < geod_dist[vi] then 
// w provides a short distance to vi ; 
create a distance update event {vi,di + d); 
if Vi is a saddle vertex then 
|_ create windows in the fanned area (see Fig.|3fc)); 

if w occupies V2 then 

// apply the one-angle-one-split filter; 

if w provides V2 a shorter distance than angle_split[v2] 

then 

create an angle update event (e, lu); 
create w's two children (see Fig.[3|b)); 

else 

// w has only one child on edge {vo,V2) or (t;i, 112); 
create w's child (see Fig.|4ja)); 

if ||/ti2|| + d < geod_dist[v2] then 

create a distance update event {v2, \\Iv2\\ + d); 
if V2 is a saddle vertex then 
1^ create windows in the fanned area (see Fig.[3fc)); 

else 

// w has only one child on edge {vo,V2) or (vi, V2); 

create w's child (see Fig.|3[a)); 
apply the ICH filter for the new windows and store the useful 
windows in the buffer (see Fig.|4jb)); 



the number of GPU threads. We then compute the accumulated 
number of new windows for each thread. Finally, each GPU thread 
copies its new windows from its buffer to the memory pool. The 
pseudo code for data organization is shown in function Organiz- 
ingData(). We use the same procedure to organize the generated 
distance and angle update events. 



Algorithms: ReorganizeData({ciataCoitnt}, {data}, addr) 

II datai. the buffer storing the output of the i-th thread; 
// dataCounti. the size of datat; 

II addr: the address of the first free slot in the memory pool; 

cumo <r- ; 

for i = 1 ?o T - 1 do 

|_ cumi cumi-\ + dataCounti-i 

parallel for each thread i do 
|_ copy datai to addr + currii 



Step 4. Processing the update events (line 6) The window prop- 
agating step produces distance and angle update events, which are 
organized in step 3 such that they are stored in the memory pool in 
a seamless manner This allows us to parallel process these events. 

An event is a 2-tuple {key, value). For the distance update event, 
key is the vertex id, and value is the geodesic distance. For the 
angle update event, key is id of the oriented edge, which determines 
the opposite angle/vertex, say v, and value is a window occupies 

V. 



To avoid the conflicts in events updating, we must sort all events by 
their keys in increasing or decreasing order. If two keys tie, we com- 
pare the two events by their values. For example, for two distance 
events {vi, d!) and {vi, d") occurring at the same vertex Vi, the one 
with smaller distance wins. To break a tie between two windows, 
we compare the shortest distance from the source to the correspond- 
ing interval. This tie-breaking enables us to process only the first 
event at a vertex or angle, since all the subsequent events occurring 
at the same vertex or angle do not carry useful information. We then 
assign the ordered events evenly to T GPU threads, which update 
the corresponding vertices or angles. The pseudo code is shown in 
function ProcessingEvents(). 



Algorithm 4: ProcessingEvents({euent}, {Data}) 
II eventi. a 2-tuple {key, value}; 

II {Data}: the to-be-updated array geodjdist or angle_split; 
thrust sort {event} by key and use value for tie breaking; 
S <— size of {event}; 
parallel for each thread i do 

for j = ix \§] to (i + 1) X \§] - 1 do 
it event j .key ^ event j-i.key then 
|_ Data[eventj.key] ■(— event j .value; 



4.3 Correctness 

The correctness of our parallel algorithm is based on two obser- 
vations. First, during the window propagation, we adopt Chen and 
Han's "one angle one split" and Xin and Wang's filtering techniques 
to reduce the total number of windows. It is shown | |Chen and Han| 
|1990|prin and Wang 20 09] that both techniques remove only the 
useless windows, which do not carry the shortest distance. Sec- 
ond, Mitchell et al. (1987) proved that there exists a geodesic path 
from the source s to any other vertex. Our window propagation step 
keeps the iisefid windows, which carry the shortest distance. Thus, 
for a vertex v, we can always find a sequence of windows wi, W2, 
■ ■ ■ , TO; , to encode the geodesic path, where wi is the last window 
on the edge e opposite to v such that wi occupies the vertex v. So 
Wi finally assigns the geodesic distance at v. Regardless of the or- 
der in which the windows are processed, such window sequence 
always exists. Thus, the resulting distance field is correct. 

5 Implementation & Experimental Results 
5.1 Implementation Details 

We implemented our parallel geodesic algorithm on a 64-bit PC 
with an Intel Xeon 2.66GHz CPU and 12GB memory. The graphics 
card is an Nvidia GTX 580 with 512 cores and 1.5GB memory. Our 
program is compiled using CUDA 4.2. 

The input mesh is maintained by using the half-edge data structure. 
Only one copy of the mesh structure is transferred to GPU mem- 
ory. More specifically, we need only an array of half-edges, each 
of which has the length information. We do not need to store the 
vertex in the GPU memory; the face connectivity is also encoded in 
the half-edge structure. 

Our algorithm requires a parallel selection to choose the k nearest 
windows and a parallel sorting to order all update events by their 
keys and values. Selection and sorting can be fully implemented 
in parallel. We adopted the probabilistic parallel selection [Monroe 
|etal. 2 011 1 and CUDA's thrust sorting [Bell and Hoberock 201 1 1 in 
our implementation, but, surprisingly, observed poor performance. 



As shown in Figure |6] (a), the GPU selection and sorting are very 
slow for the 144K-face Bunny, taking more than 80% of the exe- 
cution time. This slowness is mainly because the GPU selection 
algorithms require a large amount of CPU/GPU synchronization. 
As a result, these algorithms obtain good performance only for a 
very large number of elements (for example, more than 1 million). 
To improve the performance, we adopted the following simple yet 
effective strategies in our implementation. 

First, we implement an approximate fc-selection on the GPU. As- 
sume that there are A''(> k) windows, wi, - ■ ■ , uijv, in the memory 
pool and that there are T threads available. We allocate the follow- 
ing windows Wi, Wi+r, Wi+2T, ■ • • , to the i-th thread, which en- 
ables each thread to obtain approximately \N/T] windows, among 
which the \k/T] nearest windows are selected. Thus, T threads 
select k windows in total. This approximate selection uses only a 
single GPU launch and can significantly reduce the execution time. 

Second, since GPU sorting is not efficient, we copy the update 
events from GPU to CPU. To minimize the CPU/GPU data transfer, 
we use cudaHostAllocO to allocate the memory for global variables 
geod^distW and angle^splitW. The memory is mapped to CUDA 
address space using cudaHostAUocMapped, so both CPU and GPU 
can access it. All writing operations on such global variables are 
performed by CPU and the GPU procedure PropagatingWindow() 
accesses these data in a read-only manner. Although this CPU op- 
eration makes the event updating step sequential, we find it is much 
more efficient than the GPU sorting and updating and improves the 
overall performance significantly. See Figure[6fb). 

Remark 1. Our PCH algorithm requires the windows to be stored 
continuously in the physical memory. This condition plays an im- 
portant role in the parallel selection step, where the i-th thread 
checks the i, i + T, i + 2T, ■ • ■ , positions in the memory pool, and 
then selects the window nearest to the source. The continuously 
stored windows guarantee that each thread requires approximately 
the same number of windows, which means that the workload is 
balanced. Also, because different threads may generate different 
numbers of windows during the window propagation step, the pro- 
cedure ReorganizeDataO is required in order to reorganize the new 
windows and ensure the condition for the next-round parallel selec- 
tion. 

Remark 2. Although our implementation adopts the approximation 
k-selection to choose the windows, the correctness of the computed 
geodesic distance is guaranteed, since our algorithm does not delete 
any useful windows and the correctness of the geodesic distance 
is independent of the order in which the windows are processed. 
Therefore, the resulting geodesic distance is exact. 

Remark 3. One of the key factors in designing a parallel algorithm 
is to effectively avoid data conflicts. Because our PCH algorithm 
propagates the windows and then updates the arrays geod_dist and 
angle^split in parallel, it is critical to maintain the data consistency 
in these arrays. Mutex and atomic operations are two commonly 
used techniques to access the shared memory in parallel comput- 
ing. However, these techniques are not suitable to the proposed 
PCH algorithm, which requires frequent access of the global ar- 
rays geod_dist and angle_split: each call of PropagatingWindow 
reads geodjdist three times and angle^dist once. It may also write 
geod_dist up to three times and angle_dist once. 

The large size of both arrays makes it expensive to set a mutex lock 
for each element. More importantly, by using mutex lock, each 
read/write operation locks the corresponding resources to avoid the 
access by other threads, which will significantly compromise the 
performance. On the other hand, using the atomic operations in 
PCH is not effective either, since doing so when reading/writing the 
two global arrays will slow down the procedure PropagatingWin- 
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(a) GPU exact selection & GPU event update (b) GPU approximate selection & CPU event update (c) Exact geodesic distance 

Figure 6: The window selection and event sorting on the GPU take 3 to 6 seconds on the 144K-face Bunny model. We implement an 
approximate selection on the GPU and process the events on the CPU, which can significantly reduce the overall execution time. The vertical 
axes are the execution time in seconds and the total number of windows. The horizontal axes are the selection parameter k. Although the 
window selection is approximate, the computed geodesic distance is exact. 



dow - the most expensive of the four steps - thereby compromising 
the performance of the entire program. 

Therefore, our implementation does not contain the mutex or 
atomic operations. Instead, we adopt the delayed update strategy 
to avoid data conflicts. Note that the procedure PropagatingWin- 
dow does not update geod^dist and angle^dist. Thanks to the read- 
only operations in PropagatingWindow, all the selected k windows 
can be propagated without any data dependence. Then, in the Pro- 
cessingEvents procedure, we sort the distance and angle updating 
events to detect and delete the conflicted data, after which the ar- 
rays can be updated in parallel. This delayed update strategy works 
quite well in practice, which is justified by the promising experi- 
mental results. 

5.2 Parameter Setting 

The performance of our PCH algorithm depends on the mesh com- 
plexity n = \ V\, the GPU's computational power (such as the num- 
ber of cores and frequency), and the selection parameter fc, which 
specifies the number of nearest windows to be processed by GPU 
threads in parallel. 

The value of k is closely related to the total number of windows 
produced in the wavefront sweep. Both the MMP algorithm and 
the ICH algorithm use a priority queue to determine which window 
is nearest to the source. Xin and Wang 1 2009 1 showed that the prior- 
ity queue is very effective in terms of controlling the total number 
of windows. If A: = 1, the fc-selection becomes a priority queue, 
leading to a small number of windows. For such a case, however, 
our program becomes a sequential program, as only one window is 
processed at a time. In order to take full advantage of the parallel 
nature of our algorithm, a large k is usually preferred, so that mul- 
tiple windows can be processed independently. On the other hand, 
for a sufficiently large k, the total number of windows increases dra- 
matically. For example, if k is larger than the total number of win- 
dows, then all windows are selected and processed simultaneously, 
which means that the from-near-to-far wavefront is not maintained 
at all. As observed in both the MMP and CH algorithms, such cases 
will result in an extremely large number of windows and, therefore, 
very poor performance. Consequently, there is a trade-off between 
controlling the total number of windows and utilizing the GPU's 
parallel structure. We tested various fc on a large number of mod- 
els and observed that the total number of windows is almost linear 
to the parameter fc; the smaller the fc, the fewer windows, and vice 
versa. Figure |7] shows that the performance curves for Lucy, Bunny 
and Golf Ball, and the other models have very similar patterns. We 
found the optimal range for fc is [16 x 2^", 24 x 2^"], which leads 



to the least execution time. 

We also investigated the relationship between the optimal range of 
fc and the mesh resolution. As Figure[8]shows, all the performance 
curves have the similar pattern, which reveals the trade-off between 
utilizing the GPU's parallel structure and obtaining the optimal per- 
formance on the Bunny model of various resolutions. We found the 
optimal range of fc is insensitive to the mesh resolution. 

5.3 Performance 

We tested our program on a wide range of 3D models, from the 
small Bunny (144K faces) to the large-scale Dragon (4 million 
faces). We observed that the execution time of the MMP/ICH al- 
gorithms and ours depends on the location of the source point, and 
that different source points may result in up to 10% — 15% time 
difference because the total number of windows is dependent of the 
source point. To obtain a consistent evaluation, we repeated our test 
100 times, where a random source point is generated each time. Ta- 
ble |3] shows the mesh complexity and the average execution time, 
memory cost, the number of windows, etc. The consistent experi- 
mental results show that our method obtains an order-of-magnitude 
improvement for models with more than 500K faces. Our algo- 
rithm is also flexible in terms of computing the "multiple-source- 
all-destination" geodesic distance. Table |2] shows the performance 
of our algorithm and the ICH algorithm on the Golf Ball and Pega- 
sus. 



# source 
point,s 


Golf ball 


Pegasus 


TjCH 




TiCH /TpCH 


TiCH 


TpCH 




1 


23.71 


2.51 


9.45 


32.61 


3.33 


9.79 


2 


22.02 


2.30 


9.59 


33.10 


3.04 


10.80 


4 


19.36 


2.01 


9.65 


29.94 


2.83 


10.56 


8 


14.61 


1.46 


10.04 


34.55 


3.16 


10.94 


16 


11.17 


1.02 


10.90 


31.67 


3.40 


9.33 


32 


8.49 


0.69 


12.32 


26.02 


2.83 


9.21 


64 


5.99 


0.50 


11.97 


24.33 


2.34 


10.41 


128 


4.60 


0.37 


12.53 


20.08 


1.99 


10.10 


256 


3.62 


0.29 


12.65 


15.39 


1.57 


9.81 


512 


2.85 


0.23 


12.46 


12.68 


1.25 


10.17 


1, 024 


2.36 


0.20 


12.06 


9.95 


0.98 


10.15 



Table 2: Statistics of "multiple -source-all-destination" distance. 
Execution time was measured in seconds. 

Figure[6jb) shows that GPU processing takes more than 75% of the 



Models 


# faces 


MMP 


ICH 


PCH 


Speedup 

TiCH /TpCH 


Time 
(s) 


Peak 
mem. (MB) 


Time 
(s) 


Peak 
mem. (MB) 


Total 
#windows 


Time 
(s) 


Peak 
mem. (MB) 


Total 
# windows 


Peak 
# windows 


Worst 


Average 


Bunny 


144, 036 


8.19 


250.49 


6.24 


7.04 


6,910,123 


0.95 


15.10 


8,912,758 


47,729 


5.28 


6.57 


Golf ball 


245, 760 


26.86 


676.32 


26.10 


12.61 


27, 770, 244 


2.49 


22.50 


28, 541, 081 


59,218 


9.78 


10.48 


Lion 


305,608 


17.18 


476.36 


12.60 


14.43 


13,986,984 


1.51 


26.80 


16, 315, 130 


54,916 


7.45 


8.34 


Sphere 


327, 680 


86.94 


1, 894.02 


70.39 


21.55 


81,022,375 


8.11 


37.08 


82,412,805 


201, 499 


8.51 


8.68 


Armadillo 


345, 944 


13.25 


412.36 


10.34 


15.91 


11, 169,475 


1.40 


28.51 


13, 159, 141 


35,599 


6.04 


7.39 


Lucy 


525, 814 


22.27 


627.03 


18.20 


22.17 


21, 590, 268 


1.71 


42.93 


22, 996, 519 


57,691 


7.43 


10.64 


Pegasus 


667, 474 


49.80 


1, 224.35 


33.69 


27.47 


28, 471, 897 


3.09 


53.15 


31,632,899 


56,460 


8.55 


10.90 


Buddha 


1,439, 116 


179.67 


3,698.71 


160.70 


67.42 


113, 292, 750 


15.20 


114.77 


122,402,062 


139, 761 


7.95 


10.57 


Ramesses 


1, 652, 528 


76.43 


1, 734.26 


52.09 


57.90 


47, 374, 020 


5.79 


124.72 


53,404,855 


55,083 


7.25 


8.99 


Gargoyle 


1,726,398 


211.25 


4,096.23 


179.90 


71.30 


119,568,420 


17.50 


139.77 


137,773,318 


208, 614 


8.47 


10.28 


Blade 


1,765,388 


205.91 


3, 725.16 


122.80 


72.12 


90, 044, 841 


11.00 


137.67 


97, 250, 102 


134, 014 


8.82 


11.16 


Isidore horse 


2,208,936 


80.85 


2, 026.17 


67.77 


69.04 


56,403,922 


6.31 


165.04 


68,469,783 


54, 294 


8.95 


10.74 


Dragon 


4, 000, 000 


534.53 


8,311.39 


421.20 


174.37 


265,327,798 


41.32 


308.40 


325,421,513 


240,455 


8.63 


10.19 



Table 3: Model complexity and performance. Tests were repeated 100 times with random source point and the mean values are reported in 
this table. The last column shows both the worst and average speedup. 



total execution time, while the corresponding ratio for large-scale 
models may exceed 90%. As all windows are processed completely 
in GPU, the performance of our program depends heavily on the 
GPU's parallel computing capacity. We verified this by testing our 
program on three Nvidia graphics card: GTX 285, GTX 470 and 
GTX 580. As Figure [To] shows, we observed that the performance 
improvement Tjch /Tpch is highly consistent with GPU double- 
precision performance (GFLOPS). Accordingly, we believe that our 
algorithm can achieve better performance on the next-generation 
graphics card. 

In addition to performance, we also measured the peak memor}|^ 
of the MMP, ICH, and PCH algorithms. In order to avoid tiny 
windows, we adopt the same tolerance 10"'^ for all the three al- 
gorithms. As Table [3] shows, the memory cost of our algorithm 
is much smaller than the MMP algorithm, which has O(n^) space 
complexity. Our memory cost is 1.5 — 2 times higher than the ICH 
algorithm due to the greater number of windows generated, as well 
as the fact that some extra space is required to store the distance and 
angle update events. 

6 Comparison & Discussion 

This section compares our method to the other exact geodesic al- 
gorithms, including the MMP algorithm, the CH algorithm, and the 
ICH algorithm. As mentioned before, all these algorithms do not 
have parallel structure. 

Comparison to the MMP algorithm The MMP algorithm takes a 
continuous Dijkstra sweep and maintains the windows on the wave- 
front using a priority queue. It is highly non-trivial to maintain the 
priority queue in parallel. The MMP algorithm requires windows 
clipping during the window propagation. The strong dependency 
among the windows makes it difficult to process the windows in 
parallel. The MMP algorithm has Oiji?) space complexity, which 
diminishes its application on the GPU because the GPU usually has 
much less RAM than CPU. Unlike the MMP algorithm, our method 
does not require the priority queue and window clipping operation. 
Our algorithm also has linear memory complexity. 

Comparison to the CH algorithm Instead of using a priority 
queue, the CH algorithm organizes the windows in a tree structure 

-The peak memory of PCH is the actual GPU memory taken by our 
algorithm, rather than the size of the memory pool. 



and propagates windows from the root to the leaves. Because only 
the branch nodes are retained, the CH algorithm has linear mem- 
ory complexity 0{n). Chen and Han also proposed a "one angle 
one split" scheme to avoid exponential explosion. Whenever a new 
window w occupies the vertex v and provides v a shorter distance, 
the original sub-tree at v is deleted. Due to the parent-child data 
dependency, only windows on the same level can be processed in 
parallel. As a result, a CPU/GPU synchronization must be con- 
ducted for each layer. Note that the depth of the tree is equal to 
the number of mesh faces. Such 0{n) synchronization is very time 
consuming, which means it is not practical to parallelize the CH 
algorithm. 

Our algorithm also adopts the "one angle one split" scheme to iden- 
tify the useless windows, which do not carry the shortest distance. 
However, our algorithm neither has the tree structure nor maintains 
the parent/child relation. This allows the windows to be propagated 
in a fully independent manner. 

Comparison to the ICH algorithm The ICH algorithm is a variant 
of the CH algorithm, which adopts an effective window filtering 
technique to reduce the number of windows. Like the original CH 
algorithm, the ICH algorithm maintains the parent-child relation 
and the tree structure. Unlike the original algorithm, however, the 
window in our algorithm does not contain its parent information 
and no windows are sorted. This feature allows us to propagate a 
large number of windows in a completely independent manner. 

The ICH algorithm also uses the priority queue to maintain the 
wavefront, which is able to minimize the total number of win- 
dows. Instead of the priority queue, our algorithm uses a simple 
fc-selection to determine the windows, which are close to the source 
points. This strategy is very effective in terms of controlling the to- 
tal number of windows. Our experimental results show that the total 
number of windows in our parallel algorithm is only slightly higher 
than that of the ICH algorithm. 

The PCH algorithm on the CPU We also implement our PCH 
algorithm on the CPU in such a way that all the parallel procedures 
are replaced by their sequential counterparts. We adopt quickselec- 
tion for the sequential fc-selection, which selects the top k windows 
without ordering them, and then compare its performance to the 
ICH algorithm on the same hardware setting. We observe that the 
CPU PCH algorithm is 5-30% slower than the ICH algorithm for 
all test models. 
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Figure 7: Testing the selection parameter k on an Nvidia GTX 580 
with 512 cores. The total number of windows (in red) is approxi- 
mately linear proportional to the selection parameter k ( the hori- 
zontal axis). Choosing a small k does not utilize the CPU's parallel 
structure at all. On the other hand, choosing a very large k would 
produce too many useless windows, resulting in slow performance 
(see the blue timing curves). We found the optimal range for k is 
[16 X 2^", 24 X 2^"], which leads to the least execution time. 




Figure 8: Civen the Bunny model of various resolutions, we found 
the optimal range of k is insensitive to the mesh resolution. 




Figure 9: More results. 
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Figure 10: The performance improvement Tich /Tpch is consis- 
tent with the CPU double precision performance (CFLOPS). 



The priority queue is known to play a critical role in the ICH al- 
gorithm, since it guarantees that the to-be-processed window is the 
closest to the source among all active windows. Strictly ordering 
the windows by their distances to the source is a very effective way 
to control the total number of windows. The CPU PCH algorithm 
selects the top k windows and then propagate them one by one. 
Because the selected k windows are not ordered, the children win- 
dows of a far-but-early-processed window may be over-ridden by a 
close-but-late-processed window. This means that the CPU PCH al- 
gorithm must process process 10-40% more windows than the ICH 
algorithm, resulting in slower performance. 

The GPU PCH algorithm processes the same number of windows 
as the CPU PCH algorithm. However, the overhead cost of process- 
ing the extra windows is very small compared to the performance 
speedup by the high throughput of the GPU. Therefore, the GPU 
PCH algorithm significantly outperforms its CPU counterpart and 
the ICH algorithm. 

Window deletion & complexity analysis The CH algorithm or- 
ganizes the windows in a tree data structure. Chen and Han (19901 
suggested that once a window becomes useless, the useless window 
and its sub-tree should be deleted (or tagged) immediately, which 
can guarantee the total number of windows is O(n^), leading to an 
O(n^) time complexity. Our PCH algorithm does not require the 
tree data structure to maintain the parent-children relation among 
windows. Also, it does not delete the useless windows until the 
update process. These features significantly reduce the data depen- 
dence, thereby providing great flexibility in the parallel algorithm 
design. However, the price to of such flexibility is the lack of rigor- 
ous complexity analysis. 

Because time complexity is linearly proportional to the total num- 
ber of windows, we empirically evaluated the time complexity by 
counting the total number of windows. As Table |3] shows, we ob- 
served that the total number of windows of the PCH algorithm is 
only 5%-20% more than that of the ICH algorithm. This implies 
that such a delay in windows deletion only has a very minor effect 
on the time complexity. As Table[3]shows, our PCH algorithm can 
achieve 8x-10x speedup for most of the models, which justifies the 
good performance of our algorithm. 

7 Conclusion & Future Work 

It is highly desirable to develop a parallel algorithm to compute 
exact geodesic on triangle meshes. However, the existing exact 
geodesic algorithms have very strong data dependence and lack the 
parallel solution. This paper presents the PCH algorithm, which ex- 
tends the classic Chen-Han algorithm to parallel setting. Our idea 
is to divide the sequential CH algorithm into four phases, namely 
window selection, window propagation, data re-organization and 
events processing, such that the operations in each phase have no 
dependence or conflicts and can be done in parallel. In contrast 
to the original CH algorithm, the proposed PCH algorithm neither 
maintains the windows in a tightly coupled manner nor sorts the 
windows according to their distance to the source. As a result, it can 
process a large number of windows simultaneously and indepen- 
dently. We also adopt a simple selection strategy which is effective 
to control the total number of windows. We implemented our par- 
allel geodesic algorithm on modern GPUs (e.g., Nvidia GTX 580) 
and observed that the performance improvement (compared to the 
conventional sequential algorithms) is highly consistent with GPU 
double precision performance (GFLOPS), which justifies the par- 
allel nature of our algorithm. Extensive experiments on real-world 
models demonstrate an order of magnitude improvement in execu- 
tion time compared to the state-of-the-art. 



effective windows filtering techniques to further reduce the total 
number of windows, thus, improving the overall performance. Sec- 
ond, our PCH algorithm naturally supports the multiple-source-all- 
destination geodesic distance, thus, it has the potential to improve 
the performance of the geodesic Voronoi diagram [ Liu et al. 201 1| . 
Third, as the first parallel geodesic algorithm with good perfor- 
mance, the PCH algorithm is highly desired in many time-critical 
applications which require extensive computation of geodesies. We 
will investigate such applications in the near future. 
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There are a few interesting future directions. First, we will study 



